Take-home Exercise 1: Geospatial Analytics for Public Good

Published

November 21, 2023

Modified

December 2, 2023

Introduction

Digitization of urban infrastructure has introduced new datasets from tracking of movement patterns, including space and time, through GPS and RFID. Understanding these datasets may potentially lead to a more informed decision making process in urban management and planning.

The purpose of this exercise is to perform Exploratory Spatial Data Analysis (ESDA), via appropriate Local Indicators of Spatial Association (GLISA) and Emerging Hot Spot Analysis (EHSA) to undercover the spatial and spatio-temporal mobility patterns of public bus passengers in Singapore.

The TASK

Geovisualisation and Analysis

  • With reference to the time intervals provided in the table below, compute the passenger trips generated by origin at the hexagon level,

    Peak hour period Station entry time
    Weekday morning peak 6am to 9am
    Weekday afternoon peak 5pm to 8pm
    Weekend/holiday morning peak 11am to 2pm
    Weekend/holiday evening peak 4pm to 7pm
  • Display the geographical distribution of the passenger trips by using appropriate geovisualisation methods,

  • Describe the spatial patterns revealed by the geovisualisation (not more than 200 words per visual).

Local Indicators of Spatial Association (LISA) Analysis

  • Compute LISA of the passengers trips generate by origin at hexagon level.

  • Display the LISA maps of the passengers trips generate by origin at hexagon level. The maps should only display the significant (i.e. p-value < 0.05)

  • With reference to the analysis results, draw statistical conclusions (not more than 200 words per visual).

Emerging Hot Spot Analysis(EHSA)

With reference to the passenger trips by origin at the hexagon level for the four time intervals given above:

  • Perform Mann-Kendall Test by using the spatio-temporal local Gi* values,

  • Prepared EHSA maps of the Gi* values of the passenger trips by origin at the hexagon level. The maps should only display the significant (i.e. p-value < 0.05).

  • With reference to the EHSA maps and data visualisation prepared, describe the spatial patterns reveled. (not more than 250 words per cluster).

I will be looking to Task 1: GeoVisualisation and Analysis and Task 3: Emerging Hot Spot Analysis in this exercise

Setting up

#loading in the necessary functions
pacman::p_load(tmap, sf, tidyverse, spdep, sfdep, DT)

Geospatial Data Wrangling

Data Import, Extraction and Processing

Aspatial Data

Passenger Volume by Origin Destination Bus Stops from LTA DataMall

First download the raw data, and import into R environment using read_csv().

odbs <- read_csv("data/aspatial/origin_destination_bus_202308.csv")
#choosing Aug data as data set

Next we will need to clean up the data sets, first by changing the data type to factor for easy sorting, filter and grouping:

odbs$ORIGIN_PT_CODE <- as.factor(odbs$ORIGIN_PT_CODE)
odbs$DESTINATION_PT_CODE <- as.factor(odbs$DESTINATION_PT_CODE)
#This changes the data type from chr to factor in the data field

The Data required are:

  • Weekday morning peak (6am to 9am)

  • Weekday afternoon peak (5pm-8pm)

  • Weekend/Holiday morning peak (11am to 2pm)

  • Weekend/Holiday evening peak (4pm to 7pm)

We will next filter our odbs to get the required data per above:

odbs_peak <- odbs %>% 
  filter((DAY_TYPE == "WEEKDAY" & 
           ((TIME_PER_HOUR >=6 & TIME_PER_HOUR <=9) | 
              (TIME_PER_HOUR >=17 & TIME_PER_HOUR <=20))) | 
           DAY_TYPE == "WEEKENDS/HOLIDAY" & 
           ((TIME_PER_HOUR >=11 & TIME_PER_HOUR <=14) |
              (TIME_PER_HOUR >=16 & TIME_PER_HOUR <=19))) %>%
  group_by(ORIGIN_PT_CODE) %>% #this allow me to extract all those trips generated
  summarize(TRIPS = sum(TOTAL_TRIPS)) #derives new field to allow me to do the aggregation
#just to take a look at the data table
datatable(odbs_peak)

To preserve the data set, we can save the output in rds format for future use:

write_rds(odbs_peak, "data/rds/odbs_peak.rds")

To import into R environment:

odbs_peak <- read_rds("data/rds/odbs_peak.rds")

Geospatial Data

  1. Bus Stop Location from LTA DataMall
busstop <- st_read(dsn = "data/geospatial",
                     layer = "BusStop") %>% 
    st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex1\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
  #the st_transform(crs = ?) function takes takes 2D ST_Geometry data as input and returns values converted into the spatial reference specified (new coordinate reference system) provided, in this case 3414 is the Projected coordinate system for SG
  1. Hexagon, a hexagon layer of 250m (this distance is the perpendicular distance between the centre of the hexagon and its edges.) should be used to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA.

Creating Hexagon

busstop_hexagon_grid = st_make_grid(busstop, c(250, 250), what = "polygons", square = FALSE)

hexagon_grid_sf = st_sf(geometry = busstop_hexagon_grid) %>%
  # add grid ID
  mutate(grid_id = 1:length(lengths(busstop_hexagon_grid)))

Data Cleaning and Transformation

We will next use st_intersection() for point and polygon overlay, to combine the data sets. This will provide us with output in point sf object.

hexagon <- st_intersection(hexagon_grid_sf, busstop) %>%
  select(1,2,4) %>%
  st_drop_geometry()
write_rds(hexagon, "data/rds/hexagon.rds")

Now we have got three data sets, namely

  1. odbs_peak : The total number of trips from the Origin Bus Stop during the specified peak period.

  2. busstop : The details of bus stop in Singapore, with the location.

  3. hexagon: A hexagon layer of 250m with bus stop location.

Performing the Relational Join (to update attribute table of one geospatial with another aspatial data set)

Now we will next combine this onto our our odbs_peak data frame which shows the total number of trips from particular bus stop during peak hour.

obs_peak <- left_join(odbs_peak, hexagon,
                      by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
  rename(ORIGIN_BS = ORIGIN_PT_CODE) %>%
  group_by(grid_id) %>% #group by the grid_ID that has values
  summarize(TOTAL_TRIPS = sum(TRIPS)) #derives new field to allow me to do the aggregation of total trips per grid IDs

Next, as data sanity, we check if any duplicating records:

duplicate <- obs_peak %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()

We noted there are duplicating records, so we will need to only retain the unique value using the below:

obs_peak <- unique(obs_peak)

This removed duplicate records from the data frame, now we have got all unique records .

However, upon closer inspection, noted there are some duplicates on records:

datatable(obs_peak)

Per the data table above, we can see that grid_id[4] and grid id[128] share the same TOTAL_TRIP, upon closer check, we noted that the busstop 25059 has the point coordinate that sits between hexagon 4 and hexagon 128.

par(mfrow = c(1, 2))
plot(hexagon_grid_sf[[1]][[4]])
points(3970.122, 28063.28)
plot(hexagon_grid_sf[[1]][[128]])
points(3970.122, 28063.28)

We will keep the data sitting on the line of hexagon as it is, for now.

Next, we combine our obs_peak data frame onto the hexagon_grid_sf data frame to understand the total number of trips per origin bus stop during peak period in hexagon.

#hexagon_obs_peak <- left_join(hexagon_grid_sf, obs_peak, by = "grid_id") %>%
#  drop_na(TRIPS)


#hexagon_obs_peak <- unique(hexagon_obs_peak)


hexagon_obs_peak <- left_join(hexagon_grid_sf, obs_peak,
            by = "grid_id") %>%
  drop_na("TOTAL_TRIPS","grid_id")  #remove the entries with NA in TRIPS and grid_id
tmap_mode("view")
tmap_options(check.and.fix = TRUE)
map_honeycomb = tm_shape(hexagon_obs_peak) +
  tm_fill(
    col = "TOTAL_TRIPS",
    palette = "Reds",
    style = "cont",
    title = "Number of Trips",
    alpha = 0.6,
    popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
    popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)+
  tm_scale_bar()

map_honeycomb_q = tm_shape(hexagon_obs_peak)+ 
  tm_fill("TOTAL_TRIPS", 
          style = "quantile", 
          palette = "Reds",
          title = "Number of Trips (Quantile)") +
  tm_layout(main.title = "Passenger Trips during Peak Hours",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE) +
  tm_borders(alpha = 0.3) +
  tm_scale_bar() +
  tm_grid(alpha =0.2)

tmap_arrange(map_honeycomb, map_honeycomb_q, asp=1, ncol=2)

Next we take a closer look at those hexagon area that have >100k total number of trips:

tmap_mode("view")
tmap_options(check.and.fix = TRUE)
map_honeycomb = tm_shape(hexagon_obs_peak %>%
                         filter(TOTAL_TRIPS>100000)) +
  tm_fill(
    col = "TOTAL_TRIPS",
    palette = "Reds",
    style = "cont",
    title = "Number of TRIPS",
    id = "grid_id",
    alpha = 0.6,
    popup.vars = c("Number of TRIPS: " = "TOTAL_TRIPS"),
    popup.format = list(TOTAL_TRIPS = list(format = "f", digits = 0))) +
  tm_borders(col = "grey40", lwd = 0.7)


map_honeycomb

Own Analysis

The above map shows that the areas with >100k total trips during the peak hours are maintain saturated along key residential area, or likely in areas where there is direct bus to cities / CBD (for those commuting to school / work during peak hours).

It is also worth noting that the areas near SG-MY causeway also have high number of total trips during peak period, likely due to people residing in Malaysia heading to Singapore for work / study, likely contributed by weekdays peak period.

Emerging Hot Spot Analysis(EHSA)

Objective is to:

  • Perform Mann-Kendall Test by using the spatio-temporal local Gi* values,

  • Prepared EHSA maps of the Gi* values of the passenger trips by origin at the hexagon level. The maps should only display the significant (i.e. p-value < 0.05).

  • With reference to the EHSA maps and data visualisation prepared, describe the spatial patterns reveled. (not more than 250 words per cluster).

First we need to create the Time Series Cube:

Additional

Note: This is purely for my own curiosity.

mpsz <- st_read(dsn="data/geospatial",layer="MPSZ-2019") %>% 
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\cftoh\ISSS624\Take-home_Ex\Take-home_Ex1\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
busstop_mpsz <- st_intersection(busstop, mpsz) %>%
  select(BUS_STOP_N, SUBZONE_C) %>%
  st_drop_geometry()
obs_peak <- left_join(odbs_peak, busstop_mpsz,
                      by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
  rename(ORIGIN_BS = ORIGIN_PT_CODE,
         ORIGIN_SZ = SUBZONE_C)

obs_peak <- unique(obs_peak)
mpsz_obs_peak <- left_join(mpsz, 
                           obs_peak,
                           by = c("SUBZONE_C" = "ORIGIN_SZ"))

Preparing a choropleth map showing the distribution of passenger trips at planning sub-zone level:

mpsz_obs_peak %>%
  drop_na(TRIPS)
Simple feature collection with 5013 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 21448.47 xmax: 55941.94 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
          SUBZONE_N SUBZONE_C      PLN_AREA_N PLN_AREA_C       REGION_N
1  INSTITUTION HILL    RVSZ05    RIVER VALLEY         RV CENTRAL REGION
2  INSTITUTION HILL    RVSZ05    RIVER VALLEY         RV CENTRAL REGION
3    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
4    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
5    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
6    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
7    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
8    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
9    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
10   ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
   REGION_C ORIGIN_BS TRIPS                       geometry
1        CR     13089  7899 MULTIPOLYGON (((28481.45 30...
2        CR     13099 15531 MULTIPOLYGON (((28481.45 30...
3        CR     04321 10466 MULTIPOLYGON (((28087.34 30...
4        CR     06129  7568 MULTIPOLYGON (((28087.34 30...
5        CR     06151  6104 MULTIPOLYGON (((28087.34 30...
6        CR     06159 11722 MULTIPOLYGON (((28087.34 30...
7        CR     06169 14622 MULTIPOLYGON (((28087.34 30...
8        CR     13079  5238 MULTIPOLYGON (((28087.34 30...
9        CR     13109 16274 MULTIPOLYGON (((28087.34 30...
10       CR     13119  3949 MULTIPOLYGON (((28087.34 30...
tm_shape(mpsz_obs_peak)+ 
  tm_fill("TRIPS", 
          style = "quantile", 
          palette = "Reds",
          title = "Passenger trips") +
  tm_layout(main.title = "Passenger Trips during Peak Hours",
            main.title.position = "center",
            main.title.size = 1.2,
            legend.height = 0.45, 
            legend.width = 0.35,
            frame = TRUE) +
  tm_borders(alpha = 0.3) +
  tm_compass(type="8star", size = 2) +
  tm_scale_bar() +
  tm_grid(alpha =0.2) +
  tm_credits("Source: Planning Sub-zone boundary from Urban Redevelopment Authorithy (URA)\n and Population data from Department of Statistics DOS", 
             position = c("left", "bottom"))

An extra step to look at the distribution of the total number of trips during peak period broken down by Region distribution.

mpsz_obs_peak %>%
  drop_na(TRIPS) #this is remove NA data
Simple feature collection with 5013 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 21448.47 xmax: 55941.94 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
          SUBZONE_N SUBZONE_C      PLN_AREA_N PLN_AREA_C       REGION_N
1  INSTITUTION HILL    RVSZ05    RIVER VALLEY         RV CENTRAL REGION
2  INSTITUTION HILL    RVSZ05    RIVER VALLEY         RV CENTRAL REGION
3    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
4    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
5    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
6    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
7    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
8    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
9    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
10   ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
   REGION_C ORIGIN_BS TRIPS                       geometry
1        CR     13089  7899 MULTIPOLYGON (((28481.45 30...
2        CR     13099 15531 MULTIPOLYGON (((28481.45 30...
3        CR     04321 10466 MULTIPOLYGON (((28087.34 30...
4        CR     06129  7568 MULTIPOLYGON (((28087.34 30...
5        CR     06151  6104 MULTIPOLYGON (((28087.34 30...
6        CR     06159 11722 MULTIPOLYGON (((28087.34 30...
7        CR     06169 14622 MULTIPOLYGON (((28087.34 30...
8        CR     13079  5238 MULTIPOLYGON (((28087.34 30...
9        CR     13109 16274 MULTIPOLYGON (((28087.34 30...
10       CR     13119  3949 MULTIPOLYGON (((28087.34 30...
agg_mpsz_obs_peak_byzones <- mpsz_obs_peak %>%
  filter(TRIPS != "NA") %>%
  group_by(REGION_N) %>%
  summarize(TRIPS = sum(TRIPS)) %>%
  st_drop_geometry()
library(scales)
ggplot(agg_mpsz_obs_peak_byzones) +
  geom_bar(aes(x=REGION_N, 
               y=TRIPS,
               fill=REGION_N),
           stat="identity") +
  scale_y_continuous(labels = label_number(suffix = "M", scale = 1e-6)) +
  labs(title="Number of TRIPS during Peak Period",
       subtitle="(sorted by Region)",
       y="Number of Trips",
       x="Region")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.line.y=element_line(colour="grey50"),
        legend.text = element_text(size = 8))

Although we have noted from previous data analysis that the highest number of trip originated from West Region, the above bar chart shows that in total the number of trips in central region is still the highest. Breaking it down further, we can see from the count below that there is significantly larger number of bus stop stations in Central Area, which likely to be the factor attributing to the highest aggregate number of trips across the regions. Another deduction is that the commuters taking trips in central area may take shorter bus ride, or make more transfers within the central region than those taking the bus stop other regions.

table(mpsz_obs_peak$REGION_N)

   CENTRAL REGION       EAST REGION NORTH-EAST REGION      NORTH REGION 
             1404               820               792               677 
      WEST REGION 
             1339 

How about if we do the analysis based on subzone?

mpsz_obs_peak %>%
  drop_na(TRIPS) #this is remove NA data
Simple feature collection with 5013 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 21448.47 xmax: 55941.94 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
          SUBZONE_N SUBZONE_C      PLN_AREA_N PLN_AREA_C       REGION_N
1  INSTITUTION HILL    RVSZ05    RIVER VALLEY         RV CENTRAL REGION
2  INSTITUTION HILL    RVSZ05    RIVER VALLEY         RV CENTRAL REGION
3    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
4    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
5    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
6    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
7    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
8    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
9    ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
10   ROBERTSON QUAY    SRSZ01 SINGAPORE RIVER         SR CENTRAL REGION
   REGION_C ORIGIN_BS TRIPS                       geometry
1        CR     13089  7899 MULTIPOLYGON (((28481.45 30...
2        CR     13099 15531 MULTIPOLYGON (((28481.45 30...
3        CR     04321 10466 MULTIPOLYGON (((28087.34 30...
4        CR     06129  7568 MULTIPOLYGON (((28087.34 30...
5        CR     06151  6104 MULTIPOLYGON (((28087.34 30...
6        CR     06159 11722 MULTIPOLYGON (((28087.34 30...
7        CR     06169 14622 MULTIPOLYGON (((28087.34 30...
8        CR     13079  5238 MULTIPOLYGON (((28087.34 30...
9        CR     13109 16274 MULTIPOLYGON (((28087.34 30...
10       CR     13119  3949 MULTIPOLYGON (((28087.34 30...
agg_mpsz_obs_peak_bysubzones <- mpsz_obs_peak %>%
  filter(TRIPS != "NA") %>%
  group_by(SUBZONE_N) %>%
  summarize(TRIPS = sum(TRIPS)) %>%
  st_drop_geometry()

We can see there is a total of 313 subzones, once aggregated.

library(scales)
ggplot(agg_mpsz_obs_peak_bysubzones) +
  geom_bar(aes(x=SUBZONE_N, 
               y=TRIPS),
           stat="identity") +
  scale_y_continuous(labels = label_number(suffix = "M", scale = 1e-6)) +
  labs(title="Number of TRIPS during Peak Period",
       subtitle="(sorted by Region)",
       y="Number of Trips",
       x="Region")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.line.y=element_line(colour="grey50"),
        legend.text = element_text(size = 8))

top_n(agg_mpsz_obs_peak_bysubzones, n=10, TRIPS) %>%
          ggplot(., aes(x=SUBZONE_N, 
                        y=TRIPS,
                        fill=SUBZONE_N))+
              geom_bar(stat='identity') +
  scale_y_continuous(labels = label_number(suffix = "M", scale = 1e-6)) +
  labs(title="Top 10 Highest number of TRIPS during Peak Period",
       subtitle="(sorted by SubZones)",
       y="Number of Trips",
       x="SubZones")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.line.y=element_line(colour="grey50"),
        legend.text = element_text(size = 8))